Executive Summary

The purpose of this vignette is to explore relationship of risk with other factors of World Bank’s projects.

Key takeaways

  • Risk is somewhat related to the region in which a project is implemented. However, each region contributes to the overall risk in a different way. Some do not follow a distinct trend. Others are more likely to involve high-risk projects (Africa), or moderate risk (Europe and Central Asia). We explore this in detail.

  • There are other, stronger predictors of overall risk rating, such as: the tenure of the project’s TTL (experienced ones are more likely to be trusted with high-risk projects), the scale of the project (as represented by the amounts committed), and the year when projects were approved (as a proxy of world events, changes in the Bank’s risk tolerance, etc.)

Preparation of the dataset

The following packages are used:

library(exploratory)
library(janitor)
library(lubridate)
library(hms)
library(tidyr)
library(stringr)
library(readr)
library(forcats)
library(RcppRoll)
library(dplyr)
library(tibble)
library(rio)
library(plotly)
library(reshape2)
library(alluvial)
library(caret)
library(corrplot)
library(vcd)
library(graphics)

Tidy data

The provided datasets — project_data and risk_data — come in wide and long formats, respectively.

We bring these datasets together in a tidy format, where each column is a variable, and each row is a unique observation. In this case, the unique identifier is project_id.

We also create a numeric representation of risk as risk_numeric to facilitate computation in R. The risk can be Low (1), Moderate (2), Substantial (3), or High (4).

# Encoding risk
risk_data <- risk_data %>% 
  mutate(risk_rating = 
  ifelse(risk_rating==c("L"),1,
  ifelse(risk_rating==c("M"),2,
  ifelse(risk_rating==c("S"),3,
  ifelse(risk_rating==c("H"),4,0)))))

# Reshaping the data and cleaning up
risk_data_wide <- dcast(risk_data, 
                        project_id + risk_rating_sequence ~ risk_rating_code,
                        value.var="risk_rating",
                        fun.aggregate=mean)
# Joining the data and getting rid of variables that have no variation
joined_data <- project_data %>% 
  inner_join(risk_data_wide, by = c("project_id" = "project_id")) %>%
  select(-scale_up, -len_instr_type)

Part 1. Risk and Regions

Interactive Map for Exploratory Data Analysis

The main purpose of this tool is to allow quick summarization for a particular type of risk, or all of them, and time period.

It allows you to slice, aggregate, and visualize data in real time.

Distribution of Risk


Assumption 1: The below analysis was performed on the entire universe of the data. Such risks as political, governance, and macro, may change momentarily once there is new administration in place. Therefore, the produced insights should be treated as generalizations from historical data.

Assumption 2: All risk evaluations are performed by staff, subjectively. Overall risk category is a qualitative assessment of the risk, which is not a linear function of other types of risk.

Assumption 3: When making a decision on a project, Overall risk is the key factor. We primarily focus on this type of risk in this vignette. The relationship of region with subcategories of risk are presented in the correlation matrix.


We start with an alluvial plot to visualize the many relationships between overall risk and regions.

# Prepare data for visualization 
joined_data_freq <- joined_data %>% 
  group_by(risk_overall, region, fcs_indicator, proj_emrg_recvry_flg) %>% 
  summarise(freq=n()) %>% 
  filter(region != "OTH") %>% 
  arrange(desc(region))

# Create an alluvial chart
alluvial(joined_data_freq[,1:4], 
         freq=joined_data_freq$freq, 
         border=NA,
         hide = joined_data_freq$freq < quantile(joined_data_freq$freq, 0.5),
         col=ifelse(joined_data_freq$risk_overall == "4", "red", 
             ifelse(joined_data_freq$risk_overall == "3", "orange", 
             ifelse(joined_data_freq$risk_overall == "2", "cyan", "blue"))))

The above figure shows – in color – how different levels of risk are distributed across regions (including those facing fragile, conflict, or emergency situations). These situations have been chosen to accompany regions because they are intrinsically related to specific geographies.

It can be seen that:

  • A little over 50% of projects have a Substantial (3) risk rating. The second most frequent rating is Moderate. This shows most risk assessors refrain from making extreme judgements.
  • Therefore, it is not surprising that Substantial (3) and High (4) risk projects span across the regions in a way close to a normal distribution. Percentage breakdown of overall risk by region can be found below.
  • Africa is the biggest region where projects are implemented. It can be seen that over 2/3 of the projects there are of a Substantial (3) or High (4) risk.

The takeaway of this table is that proportions of projects of each risk category are approximately the same across all regions. This is a piece of evidence showing that risk is not substantially or exclusively related to the region in which a project is being implemented.

Let’s see if we can prove it statistically.

Contingency Analysis

In this segment we utilize Chi-Squared test of independence. It can be applied to a frequency table formed by categorical variables: risk_overall and region in this case.

conting_risk_reg <- table(joined_data$region, joined_data$risk_overall)
conting_risk_reg
##      
##         1   2   3   4
##   AFR  20 102 236  79
##   EAP   4  68 131  18
##   ECA   9  90  68  12
##   LCR   4  71  71  10
##   MNA   3  21  32  18
##   OTH   0   0   2   0
##   SAR   7  61 103  22
chisq_test_risk_reg <- chisq.test(conting_risk_reg)
chisq_test_risk_reg
## 
##  Pearson's Chi-squared test
## 
## data:  conting_risk_reg
## X-squared = 87.094, df = 18, p-value = 4.782e-11

With p-value < 0.05 we reject the null hypothesis of the test. Therefore, we are able to say that there is a (statistically) significant associtation between the categories of region and level of risk. In other words, region and risk are not independent, but somewhat associated.

mosaicplot(conting_risk_reg, shade = TRUE,  
           las=2, cex.axis = 1,
           main = "Mosaic Plot of Risk-Region Relationship")

The colors represent the level of contribution to the relationship between variables. Specifically, blue means there are more observations in that box than would be seen in a random distribution under the null hypothesis of the above Chi-Square test for independence. Red means there are fewer observations than would be expected. The red and blue boxes are the reason null hypothesis of the chi-squared is rejected.

Conclusion

Our results show that:

  • Africa is strongly associated with high level of risk, and is much less frequently hosting moderate risk projects.

  • Europe and Central Asia is the most likely to have moderate risk projects, and is unlikely to host substantial or high level risk projects.

  • Latin America and Caribbean projects are more about the extremes: they have tendency to be a place for either moderate or high risk projects.

  • Middle East and North Africa is a place where high risk projects are more likely.

  • East Asia and Pacific, South Asia, and other regions do not have strong associations with a particular level of risk.


Part 2. Correlates of Risk

Correlation Matrix

To better our understand of the relationship between risk and other variables we will look for more patterns inside the data.

Let’s create a correlation matrix that will help us understand which variables are similar based on how the underlying data varies.

# Performing one-hot encoding of the region and other variables. The purpose is to transform each value of each categorical feature into a binary feature {0, 1}
j_data_corr <- as.data.frame(joined_data)

j_data_corr$region <- as.factor(j_data_corr$region)
for(level in unique(j_data_corr$region)){
  j_data_corr[paste("reg", level, sep = "_")] <- ifelse(j_data_corr$region == level, 1, 0)
}

j_data_corr$fcs_indicator <- ifelse(j_data_corr$fcs_indicator == "Y", 1, 0)
j_data_corr$proj_emrg_recvry_flg <- ifelse(j_data_corr$proj_emrg_recvry_flg == "Y", 1, 0)

j_data_corr <- j_data_corr %>% select(-tl) 
# Running Spearman correlation analysis
var_corrs <- j_data_corr %>% 
  do_cor(which(sapply(., is.numeric)), 
         use = "pairwise.complete.obs", 
         method = "spearman", 
         distinct = FALSE, 
         diag = TRUE)

Speaking of regions and subcategories of risk, the strongest relationships are:

  • The 0.25 correlation between Africa region and political and governance risk hints at instability in the region.

  • This seems especially true relative to the -0.28 political risk in East Asia and Pacific.

  • Similar situation is observed for macroeconomic risk in these regions.

  • Other regions do now show as strong of a relationship with different types of risk. This is in line with our observations from Part 1.

A handful of other insights can be extracted from this plot: - The -0.29 correlation between fcs_indicator and net_commit_amt, net value of the World Bank loans. It is hard to provide loans if it is uncertain whether the loanees will be in place once a situation is resolved.

  • The -0.95 correlation between approval_fy and risk_rating_sequence is trivial: there is less chance to assess the projects later in the lifycycle for the more recent ones.

  • The -0.62 correlation between net_commit_amt and grant is staightforward as well: the Bank is more likely to provide one type of aid over the other, either a loan or a grant. However, some exceptions apply.

Having these preliminary results we proceed to an advanced analysis.

Modelling with Extreme Gradient Boosted Trees

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1   0   4   0   0
##          2   5  52  47   5
##          3   6  71 140  18
##          4   0   1  12  17
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5529          
##                  95% CI : (0.5012, 0.6038)
##     No Information Rate : 0.5265          
##     P-Value [Acc > NIR] : 0.1639          
##                                           
##                   Kappa : 0.2106          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity           0.00000   0.4062   0.7035  0.42500
## Specificity           0.98910   0.7720   0.4693  0.96154
## Pos Pred Value        0.00000   0.4771   0.5957  0.56667
## Neg Pred Value        0.97059   0.7175   0.5874  0.93391
## Prevalence            0.02910   0.3386   0.5265  0.10582
## Detection Rate        0.00000   0.1376   0.3704  0.04497
## Detection Prevalence  0.01058   0.2884   0.6217  0.07937
## Balanced Accuracy     0.49455   0.5891   0.5864  0.69327

We have built a model of overall risk, and the above results pertain to 30% data that was held off to test it.

The results show that using XGBoost allows for a 0.69 accuracy of prediction of High risk. For both risk levels 2-3 the accuracy is 0.58, better than random assignment. The accuracy for risk level 1 is around 0.49. This is a good result, given that we are concerned the most with higher levels of risk. In this scenario it is more valuable to accurately predict high risks and potentially have false alarms, instead of missing a high risk completely. Moreover, there are only 11 cases of risk level 1, which is a cause of learning problems for the algorithm.

Variable Importance

#m_xgb_coef$importance <- format(m_xgb_coef$importance, scientific=F)
options("scipen"=100, "digits"=4)
p <- m_xgb_coef %>% filter(importance>0.00683778) %>% 
  ggplot(aes(x = reorder(feature, importance), 
             y = importance,
             fill = importance)) +
    coord_flip() +
    scale_fill_gradient(low = "gray", high = "red", "Variable\nimportance") +
    geom_bar(stat = "identity") +
    theme_bw() +
    xlab("Variable names") +
    ylab("")
ggplotly(p, tooltip=c("y"))

According to the plot above, the most important features in this dataset to predict risk are divided in two clusters:

Higher importance:

  • tl_since: Date the current project manager came on to the project
  • net_commit_amt: Value of the World Bank loan(s) associated with the project in millions of USD.
  • approval_fy: Fiscal Year when the project was approved

Lower importance:

  • grant: Value of any World Bank grants associated with the project in millions of USD.
  • fcs_indicator: Indicates if the project is in a Fragile or Conflict Situation.
  • certain regions, countries, practices, etc.

This again supports our previous findings, showing that some regions are better predictors of risk than others. For example, the modelling exercise confirms the above considerations that East Asia Pacific is a stronger predictor that some other regions. However, a new piece of information is that Europe and Central Asia is also noticeable.

There are many interesting attributes to explore. The tenure of the project’s TTL, the scale of the project, etc. But we will focus on one for now. Let it be approval_fy. It is interesting to know what the dynamics of project risk have been in the past years.

Once again, to be statistically rigorous, we employ Chi-Square test for independence.

conting_risk_reg2 <- table(joined_data$approval_fy, joined_data$risk_overall)

assoc(conting_risk_reg2, shade=TRUE, legend = TRUE, compress = TRUE,
      rot_varnames = 0, rot_labels = 0,
      xlab="Year", offset_labels = c(1,1,1,1))

The p-value is near zero, thus there is a relationship between year and overall risk.

Conclusion

Our results based on the data show that:

  • Years 2005 and 2007 were marked with increased attention to high risk projects.

  • In 2008—2011 World Bank was most likely to participate in moderate and low risk projects. 2009 was marked with low likelihood of taking on high risk projects. This was the aftermath of the global financial crisis of 2007–2008.

  • In 2012—2014 no strong preferences to either level of project risk can be found.

  • In 2015—2016 World Bank was much more likely to support substantial risk projects than the ones with moderate risk.

End of analysis.